Goto

Collaborating Authors

 Aiwo District


AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models

Beux, Yann Le, Audu, Oluchi, Ankeli, Oche D., Balakrishnan, Dhananjay, Weya, Melissah, Ralaiarinosy, Marie D., Ezeani, Ignatius

arXiv.org Artificial Intelligence

Existing AI bias evaluation benchmarks largely reflect Western perspectives, leaving African contexts underrepresented and enabling harmful stereotypes in applications across various domains. To address this gap, we introduce AfriStereo, the first open-source African stereotype dataset and evaluation framework grounded in local socio-cultural contexts. Through community engaged efforts across Senegal, Kenya, and Nigeria, we collected 1,163 stereotypes spanning gender, ethnicity, religion, age, and profession. Using few-shot prompting with human-in-the-loop validation, we augmented the dataset to over 5,000 stereotype-antistereotype pairs. Entries were validated through semantic clustering and manual annotation by culturally informed reviewers. Preliminary evaluation of language models reveals that nine of eleven models exhibit statistically significant bias, with Bias Preference Ratios (BPR) ranging from 0.63 to 0.78 (p <= 0.05), indicating systematic preferences for stereotypes over antistereotypes, particularly across age, profession, and gender dimensions. Domain-specific models appeared to show weaker bias in our setup, suggesting task-specific training may mitigate some associations. Looking ahead, AfriStereo opens pathways for future research on culturally grounded bias evaluation and mitigation, offering key methodologies for the AI community on building more equitable, context-aware, and globally inclusive NLP technologies.


TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Lim, Jimin, Damerla, Arjun, Jiang, Arthur, Le, Nam

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.


Sampling from Gaussian Processes: A Tutorial and Applications in Global Sensitivity Analysis and Optimization

Do, Bach, Ajenifuja, Nafeezat A., Adebiyi, Taiwo A., Zhang, Ruda

arXiv.org Machine Learning

High-fidelity simulations and physical experiments are essential for engineering analysis and design. However, their high cost often limits their applications in two critical tasks: global sensitivity analysis (GSA) and optimization. This limitation motivates the common use of Gaussian processes (GPs) as proxy regression models to provide uncertainty-aware predictions based on a limited number of high-quality observations. GPs naturally enable efficient sampling strategies that support informed decision-making under uncertainty by extracting information from a subset of possible functions for the model of interest. Despite their popularity in machine learning and statistics communities, sampling from GPs has received little attention in the community of engineering optimization. In this paper, we present the formulation and detailed implementation of two notable sampling methods -- random Fourier features and pathwise conditioning -- for generating posterior samples from GPs. Alternative approaches are briefly described. Importantly, we detail how the generated samples can be applied in GSA, single-objective optimization, and multi-objective optimization. We show successful applications of these sampling methods through a series of numerical examples.


Do Chinese models speak Chinese languages?

Wen-Yi, Andrea W, Jo, Unso Eun Seo, Mimno, David

arXiv.org Artificial Intelligence

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.


Benchmarks as Microscopes: A Call for Model Metrology

Saxon, Michael, Holtzman, Ari, West, Peter, Wang, William Yang, Saphra, Naomi

arXiv.org Artificial Intelligence

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their models have generalized traits such as reasoning or open-domain language understanding based on these flawed metrics. The science and practice of LMs requires a new approach to benchmarking which measures specific capabilities with dynamic assessments. To be confident in our metrics, we need a new discipline of model metrology -- one which focuses on how to generate benchmarks that predict performance under deployment. Motivated by our evaluation criteria, we outline how building a community of model metrology practitioners -- one focused on building tools and studying how to measure system capabilities -- is the best way to meet these needs to and add clarity to the AI discussion.


Top Innovative Artificial Intelligence (AI) Powered Startups Based in Finland (2022)

#artificialintelligence

Artificial intelligence is experiencing exponential growth and is being used by thousands of businesses worldwide. It is easing our daily lives and offering solutions to the most challenging issues. Let's look at some of the most cutting-edge AI startups established in Finland. Although digital or online learning is developing quickly, it still has many shortcomings, including a lack of simplicity and personalization. Claned is a personalized online learning platform revolutionizing the digital learning arena.


50 AI & Machine Learning startups to watch in Finland

#artificialintelligence

Recently, I've curated a list of 50 Finnish startups in the field of AI & Machine Learning for those who are looking for business partners or companies to invest in. If you are an international investor who wants to connect with one of the startups, feel free to drop me a message. I can make the intro and provide the companies' investor pitch deck to you if it is available. You can also use Finder.fi to check the company's revenue development. If you are an ambitious entrepreneur (based in Finland) who is working on the next world-changing idea and is looking for funding, let's meet! I'll be happy to discuss how we can help you with the fundraising process (for free). Most of the following information is from the company website. But if you've spotted an error, please let me know and I will revise accordingly. "AISpotter has developed a time-saving, fast service for coaches all around the world. Our goal is to combine high-end technology and sports of any kind. With real-time analysis, coaches and teams are given the power to be one step ahead in team development." "We've taken over 30 years of recognized University of Oulu Machine Vision Group technology and adapted it to improve your sports game. By combining state-of-the-art machine learning and computer vision in our unique way, we provide automatic and fast analysis service for your game."


Constructing Conditional Plans by a Theorem-Prover

Rintanen, J.

Journal of Artificial Intelligence Research

The research on conditional planning rejects the assumptions that there is no uncertainty or incompleteness of knowledge with respect to the state and changes of the system the plans operate on. Without these assumptions the sequences of operations that achieve the goals depend on the initial state and the outcomes of nondeterministic changes in the system. This setting raises the questions of how to represent the plans and how to perform plan search. The answers are quite different from those in the simpler classical framework. In this paper, we approach conditional planning from a new viewpoint that is motivated by the use of satisfiability algorithms in classical planning. Translating conditional planning to formulae in the propositional logic is not feasible because of inherent computational limitations. Instead, we translate conditional planning to quantified Boolean formulae. We discuss three formalizations of conditional planning as quantified Boolean formulae, and present experimental results obtained with a theorem-prover.